Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gix-dir: Walk #1252

Merged
merged 15 commits into from
Feb 11, 2024
Merged

gix-dir: Walk #1252

merged 15 commits into from
Feb 11, 2024

Conversation

Byron
Copy link
Member

@Byron Byron commented Jan 17, 2024

Based on #1219


diff-correctness → gix-status → gix reset


Improve gix status to the point where it's suitable for use in reset functinoality.
Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.

Architecture

The reason this PR deals quite a bit with gix status is that for a safe implementation of reset() we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff the current-index with target-index. The set of files to touch can then be used to lookup information provided by git-status, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.

This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.

This leaves us in a situation where the current checkout() implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).

On the way to reset() it's a valid choice to warm up more with the matter by improving on the current gix status implementation and assure correctness of what's there, which currently doesn't seem to be the case in comparison. Further, implementing gix status similarly to git status should be made possible.

Research of Walk

  • Integrate with the untracked-file extension and provide a way to update it.
  • It looks like this wants to be directory traversal with index, untracked and ignored support, as well as empty-dir and .git dir/file handling
  • Should also include pathspec-based filtering, and looks like this is also useful for gix add then
  • ignore index untracked-cache and name-hash - these are optimizations. At the current time, we don't even write the untracked-cache back so it's OK to ignore. Also Git doesn't use it by default and it can crash if enabled.
    • Note that for WebKit this optimization is very important, even though we might be able to have some easy wins with parallelization of dirwalk + modification check, as mod-check takes 1.8s at best, and dirwalk takes 1.5s single-threaded.
    • It turns out that WebKit traversal takes 0.5s at best (without any other processing), and 0.23s with optimal thread count. So it seems worth going for jwalk enablement.
  • special attention needs to be paid to case-insensitive file systems (and looking up their path in the index)
  • needs to deal with precompose-unicode settings
  • let's have statistics right away
  • let's do a custom traversal for most control (i.e. not an iterator for now, need correctness, can easily create threaded iter from it if needed)
  • symlink handling - assure we don't follow symlinks (I presume?)
  • assure CWD doesn't have any say in the outcome
  • see what it returns for submodules
  • see what it does with cone-mode partial checkouts, or other forms of partial checkouts

Tasks

  • root tracked/untracked
  • root ignored
  • root pathspec exclude
  • pathspec handles directory properly
  • detect nested repositories in root (and deal with non-standard worktrees/gitdir combos)
  • fine-grained control over what is recursed into - need trait, as we might need submodule access, pathspecs lead the way.
  • sparse-checkout
  • actual traversal
  • traversal with change CWD
  • hide empty dirs
  • only show directory as ignored if there are only ignored inside, and as untracked if there are only ignored and untracked. Actually ignored don't matter, just tracked ones do.
    • assure we get aggregation and pruning right. See if stats can be removed.
    • pathspecs can make individual items appear that otherwise would be aggregated (i.e. dir/*). It's probably about being exact matches and what not, also works with prefix-matches like dir/ or dir (but it knows how much matched to know what to return, files or the containing directory).
    • when pathspecs match exactly, parent directories can't be collapsed anymore. Probably the opposite can be done: pruned entries are kept around. Maybe that solves the above issue as well.
    • also figure out how pathspecs are used when determining if recursion is allowed - can it force recursion? - postponed
    • motivate implementing directory icase access
  • simple gix clean
    • case-insensitive index search is broken - fix it (use hashbrown
    • assure pathspecs (leading into a directory) cause us to start the traversal early, but check if git actually does that effectively (check readdir counts) - Git only uses the prefix, like us. That leads to at least one additional readdir call, fair enough.
    • clean -nxd '*.o' should only list matching files, not collapse directories (in delete mode)
    • using pathspecs to guide into ignored directories doesn't work past the ignored boundary, also, see generated/*.o
    • using pathspecs to guide into ignored directories seems to have limits, can't go too deep, even though it's better already
    • pathspec clean -nx d/d matches and forces the directory to be shown even though no -d is given
    • it lists precious empty directories within precious directories that are to be removed
    • git skips nested directories (in ignored dirs) by default, can we do that too?
    • :!*generated* seems to expand to much
    • actually implement --execute
  • walkdir support in gix
  • gix clean uses walkdir from gix
  • gix-dir also hides bare repositories
  • find nested bare repos

Next PR

  • minimal integration of dirwalk() into gix status to learn more about it (and its performance)
  • Integrate untracked and ignored files with rename tracking of index-worktree diffs
  • status in gix crate with index-worktree
  • diff index with index to learn what we would want to do in the worktree, or alternatively,
    diff tree with index (with reverse-diff functionality to simulate diff of index with tree), for better performance as it
    would avoid having to allocate a whole index even though we are only interested in a diff.
    • Must include rename tracking.
  • how to make diff results available from status with all transformations applied, to allow user to obtain diffs of any kind?

Benchmarks

gix can hold its own and comes in close to Git, but only if it doesn't do any additional work by means of precomposing unicode or building an ignore-case hash table (which Git always builds). Interestingly, Git doesn't get slowed at all by any of these, which is surprising as precompose unicode costs about 20% of the time in our case.
Multithreading the hashtable generation seems to be the way to go.

Data
WebKit ( main) via △
❯ hyperfine -N -r4 -w1 'git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean'
Benchmark 1: git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n
  Time (mean ± σ):     676.7 ms ±  15.2 ms    [User: 223.0 ms, System: 473.8 ms]
  Range (min … max):   656.4 ms … 689.4 ms    4 runs

Benchmark 2: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
  Time (mean ± σ):     696.1 ms ±  11.5 ms    [User: 233.2 ms, System: 481.4 ms]
  Range (min … max):   679.2 ms … 703.9 ms    4 runs

Benchmark 3: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
  Time (mean ± σ):     821.3 ms ±   7.5 ms    [User: 360.8 ms, System: 482.3 ms]
  Range (min … max):   811.4 ms … 828.7 ms    4 runs

Benchmark 4: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean
  Time (mean ± σ):     893.7 ms ±   9.1 ms    [User: 438.2 ms, System: 473.1 ms]
  Range (min … max):   885.5 ms … 905.1 ms    4 runs

Benchmark 5: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
  Time (mean ± σ):     772.1 ms ±  10.2 ms    [User: 308.0 ms, System: 485.2 ms]
  Range (min … max):   761.6 ms … 785.7 ms    4 runs

Summary
  git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n ran
    1.03 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
    1.14 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
    1.21 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
    1.32 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean

WebKit ( main) via △ took 19s
❯ hyperfine -N -r4 -w1 'git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n' 'git  -c core.precomposeUnicode=1 -c core.ignorecase=0  clean -n' 'git  -c core.precomposeUnicode=1 -c core.ignorecase=1  clean -n' 'git  -c core.precomposeUnicode=0 -c core.ignorecase=1  clean -n' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean'
Benchmark 1: git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n
  Time (mean ± σ):     667.7 ms ±   5.2 ms    [User: 220.0 ms, System: 465.4 ms]
  Range (min … max):   663.0 ms … 673.2 ms    4 runs

Benchmark 2: git  -c core.precomposeUnicode=1 -c core.ignorecase=0  clean -n
  Time (mean ± σ):     690.0 ms ±  23.2 ms    [User: 231.7 ms, System: 476.2 ms]
  Range (min … max):   663.0 ms … 709.9 ms    4 runs

Benchmark 3: git  -c core.precomposeUnicode=1 -c core.ignorecase=1  clean -n
  Time (mean ± σ):     667.0 ms ±  15.6 ms    [User: 230.1 ms, System: 480.8 ms]
  Range (min … max):   657.1 ms … 690.0 ms    4 runs

Benchmark 4: git  -c core.precomposeUnicode=0 -c core.ignorecase=1  clean -n
  Time (mean ± σ):     657.3 ms ±  12.5 ms    [User: 225.8 ms, System: 477.7 ms]
  Range (min … max):   646.6 ms … 672.7 ms    4 runs

Benchmark 5: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
  Time (mean ± σ):     699.1 ms ±  23.6 ms    [User: 234.3 ms, System: 483.4 ms]
  Range (min … max):   678.7 ms … 730.3 ms    4 runs

Benchmark 6: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
  Time (mean ± σ):     842.9 ms ±  49.3 ms    [User: 365.1 ms, System: 491.1 ms]
  Range (min … max):   815.7 ms … 916.9 ms    4 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 7: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean
  Time (mean ± σ):     897.8 ms ±  30.3 ms    [User: 429.6 ms, System: 481.0 ms]
  Range (min … max):   870.2 ms … 924.0 ms    4 runs

Benchmark 8: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
  Time (mean ± σ):     769.9 ms ±  24.3 ms    [User: 298.3 ms, System: 490.2 ms]
  Range (min … max):   744.0 ms … 795.8 ms    4 runs

Summary
  git  -c core.precomposeUnicode=0 -c core.ignorecase=1  clean -n ran
    1.01 ± 0.03 times faster than git  -c core.precomposeUnicode=1 -c core.ignorecase=1  clean -n
    1.02 ± 0.02 times faster than git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n
    1.05 ± 0.04 times faster than git  -c core.precomposeUnicode=1 -c core.ignorecase=0  clean -n
    1.06 ± 0.04 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
    1.17 ± 0.04 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
    1.28 ± 0.08 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
    1.37 ± 0.05 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean

Status Enables

Next PR: Reset

  • reset() that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hard

Postponed

What follows is important for resets, but won't be needed for cargo worktree resets.

  • a way to expand sparse dirs (but figure out if this is truly always necessary) - probably not, unless sparse dirs can be empty, but even then no expansion is needed
    • wire it up in gix index entries to optionally expand sparse entries
  • gix status with actual submodule support - needs status in gix (crate) effectively
  • gix status with actual conflict support

Research

  • Ignored files are considered expandable and can be overwritten on reset
  • How to integrate submodules - probably easy to answer once gix status can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.
  • How to deal with various modes like merge and keep? How to control refresh? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.
  • Worthwhile to make explicit the difference between git reset and git checkout in terms of HEAD modifications. With the former changing HEADs referent, and the latter changing HEAD itself.
  • figure out how this relates to the current checkout() method as technically that's a reset --hard with optional overwrite check. Could it be rolled into one, with pathspec support added?
    • just keep them separate until it's clear that reset() performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.
  • for git status: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking. No, also can have worktree rename tracking, even though it's hard to imagine how this can be fast unless it's tightly integrated with untracked-files handling. This screams for a generalization of the tracking code though as the testing and implementation is complex, but should be generalisable.

Re-learn

  • pathspecs normalize themselves to turn from any kind of specification into repo-root relative patterns.
  • attribute/ignore file sources are naturally relative to the root of the repo, which remains relative (i.e. can be .. and that root will be always be used to open files like ../.gitignore, which is useful for display to the user)

@pascalkuthe
Copy link
Contributor

For the parallelism: Another rating you can consider is a workstealking executed (basically rayon nut I believe you said you already have one in core). That will call the equivalent of rayon_core:Scope::spawn for every dir entry and update any state.

You can abstract the join function to switch between single threaded and multi threaded.

Then you can just transverse the directories with std and have full manual control.

@Byron
Copy link
Member Author

Byron commented Jan 25, 2024

Thanks for sharing, I am glad you think this can be parallelised!

I was thinking to have the following progression:

  • make it work with code that quite closely resembles the source material (even though it already diverges as it's a library)
  • make it work correctly and assure there are tests for every little thing that is important

The above would already satisfy as the integration into the top-level status call will then run a bunch of different status-tasks in parallel and we'll see what that does to the overall performance (after all, there are 3 different status operations needed to get what git status does, they can all run in parallel).

I already checked the potential for parallelisation on the WebKit repository, and on my machien it was close to 2X faster when having a parallel walk, merely by comparing jwalk and walkdir. Not great, but meaningful.

The simplest way, in theory, would be to see if the algorithm in all its complexity can work with an iterator, as then one could just gix-features::fs::WalkDir for it which is parallel with the right cargo feature.

Of course, it would be more optimal to use parallelise directly, and but at least for now I am so ignorant towards rayon and for that reason along wouldn't want to use it directly. Maybe that changes one day though. moonwalk of course is also a contender as it would represent such an abstraction.

Long story short, I definitely plan for future upgrades and will do my best to have a bullet-proof test-suite that assures everything still works even after massive refactors that might come with these changes.

@Byron
Copy link
Member Author

Byron commented Jan 25, 2024

Just now I realised that the aforementioned 3 steps have interdependencies, as the index-to-worktree comparison is required to run before the walk() can happen, because it needs the index entries to have the gix_index::entry::Flags::UPTODATE flag set or else the entry will be ignored. This is a bit sad as that comparison is the most costly.
Maybe something smart can be done in due time that involves assuming all entries are up-to-date, and then re-running the walk() on just the paths that we know changed in the meantime., Oh well, kommt Zeit kommt Rat.

@pascalkuthe
Copy link
Contributor

Hmm I didn't match the git implementation quite as closely when I worked on the status stuff but it did seems like this was more of a superficial limitation of the implementation rathee than an actual interdependence.

Doing serial first makes lots of sense ofcourse.

@Byron
Copy link
Member Author

Byron commented Jan 25, 2024

Hmm I didn't match the git implementation quite as closely when I worked on the status stuff but it did seems like this was more of a superficial limitation of the implementation rathee than an actual interdependence.

It changed quite a bit since then and doesn't have mutable access to the index anymore either, instead it emits change requests (stat updates) specifically. Thus the one orchestrating these calls must know about this flag and set it before calling the dirwalk. Effectively, having this allows for an optimisation as otherwise another lstat call will be triggered.

At least that's my current understanding, thus far I think it will all be fine and optimisation comes later. Even now I am very optimistic that the implementation here will be en-par with Git, while git reset will be much faster on big trees.

@Byron Byron force-pushed the dirwalk branch 9 times, most recently from a9ece59 to b4691eb Compare January 28, 2024 20:04
@Byron Byron force-pushed the dirwalk branch 8 times, most recently from 84e1b5f to 4ed507e Compare February 2, 2024 20:02
@Byron Byron force-pushed the dirwalk branch 4 times, most recently from cf311e3 to 109505a Compare February 8, 2024 16:29
That way it's easier to use them without adding own dependnecies.
This way it's possible to match partial input against a pathspec
to see if this root would have a chance to actually match.
…atch`.

With it the caller can learn how or why the pathspec matched,
which allows to make decisions based on it that are relevant
to the user interface.
@Byron Byron force-pushed the dirwalk branch 2 times, most recently from 85b30d3 to f3aea6a Compare February 10, 2024 20:13
The new implementation does the same as Git, and initializes
an alternative lookup location.

All previous implementations with `_icase` suffix were removed
without replacement.
…ns were available.

These are only relevant when reading, and have to be regenerated after writing.
During reading, they can speed up performance by allowing multi-threading.
@Byron Byron force-pushed the dirwalk branch 3 times, most recently from 8d413ba to fa1eda0 Compare February 11, 2024 14:25
That way it's possible to obtain the current working directory
with which the repository was opened.
That way it's possible to perform arbitrary directory walks,
useful for status, clean, and add.
@Byron Byron force-pushed the dirwalk branch 2 times, most recently from 9a11d02 to 9f9b0d7 Compare February 11, 2024 17:24
@Byron Byron merged commit face359 into main Feb 11, 2024
18 checks passed
@Byron Byron mentioned this pull request Feb 11, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants